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DISCOVERING INTERESTING STATEMENTS FROM A DATABASE 

F. GEEHARDT 
B-S3731 SdnJtl Augusitn. Germany 

SUMMARY 

Knowledge dfawvwy aims a: attracting new Knowledge fmm potentially large databases; this may be in 
d« fannof Cresting stents about the data. Two interrelated dassel of pmbl^ ariseXt^ 
I, ^!L?I snb J ecn T e ffti"" of 'interesting' into concrete terms and to deal with large 
numbers of statements that arc related to one another (one Knotting the other redundant or at least less 
mte^tm^onr taweasingbr subjective facets of ■htturesiingnisss' aw identified: the subject field under 
^deration, the conspcuoamess of a finding, its novelty, and its deviation from prior knowledge A 
procedure a proposed, and tried out on two quite different dam pa,, that allows for specifying 
^S^^t^^lT^^ *?™ k L tI,e ^ i«i a way that tabes interesdngness fre^ance! 
2*? *m ? "SS refated ? e * ; (stoflaHty. affinity) into account-manifestations of the second 

and third facets of mierestingness m the given data environment. 



KEY words Knowledge discovery in databases 
Project EXPlORA 



Exploratory data analysis Intcrestiagticss 



1. DISCOVERING STATEMENTS 

During the past years, knowledge discovery m databases has attracted growing attention. The 
aun is to extract new knowledge from data sets in the form of hidden dependencies that may 
hold either in all eases or in a statistical sense; prior knowledge guides or supports the discovery 
process ui various ways. 

An example is project EXPLORA.'- 5 Statistical dependencies are found and presented to 
the user m the form of textual statements indicating, for example, subpopulatioas (groups 
denned by explanatory variables) that differ sisni£cantly from the overall average; on request 
the iext is augmented by the relevant figures. Background knowledge consists la this case 
Huunly In the ordinal or hierarchical structure of variables and in the distinction between 
explanatory variables and variables to be explained. This results in presenting only the most 
comprehensive statements, suppressing subsets as redundant or uninteresting. EXPLORA i s 
able to handle statements composed of one, two or more variables; thus the system has to 
construct vast search graphs and, once a significant statement has been found, to cut off its 
'"tSLI. ' ™ sisaota - K™ 31 ^ since sub-subgraphs may have many other ancestors 
While finding remarkable results quite rapidly, this concept suffers from some drawbacks 
The subpopulabon (group of objects) found by the search algorithm is formally significant 
by construction, but need not really be interesting; the real cause for the apparent significance 
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2 F. GEBHARDT 

may be a subgroup which is highly significant while the rest of the subpopulation is more or 
less inconspicuous. ' 

There may be a number of similar statements where the supporting groups are nevertheless 
not subsets of one another, stum as 'income above 3000, age above 30', 'income above 3500. 
age above 25*, Income above 4000. age between 20 and 60*. AD of these will be presented, 
and occasionally this amounted to about a dozen similar statements tending to confuse the 
user. 

There may be several statements that are formally completely distinct but nevertheless 
express more or less the same constellation, the cause being strong relations between certain 
values of different variables like 'retired persons 9 and 'persons above 60\ 

A solution to the first problem could be always to check the next lower level and to present 
the higher level only when the lower one is rather homogeneous; in cases of doubt* both levels 
could be shown. This would further increase the number of results. 

Procedures of this kind reduce the mass of original data to a set of selected statements, 
possibly a large set so that the user wishes to reduce it further according to his (or her) 
interests. This leaves us with two sorts of problem. 

The search algorithm may find several statements that in fact express the same constellation: 
selecting just the strongest result is inadequate since the strength is subject to statistical 
variation; there is a fluid transition from 'same constellation' to 'unrelated*. How should one 
select the most important results? 

What does 'interesting* mean; which tools could enable the user to express personal , 
subjective directions of interest? 

We shall identify some increasingly subjective facets of Iniercstlnguess and then present an 
algorithm handling a set of related statements with the aim of emphasizing those that are 
presently the most inreresdng ones. Roughly speaking, a statement that appears to be less 
important is devalued and pushed down in the ordered list of noticeable statements. It does 
not get lost, but others are given priority. 

2. INTERESTINGNESS 

2.1 Broad and narrow meanings 

The word 'interesting* is used in many papers, but the authors rarely say what they mean 
by it. If an explanation is given, it tends to signify the degree by which a result deviates from 
average or normal. Thus Piatetski-Saapiro 3 measures the interesLingness of rule A-*B 
discovered in a database essentially by the function 

\Ana\-\A\\a\i\M\ 

after some normalization where A and B are two subpopulations of population M and | - 1 
denotes as usual the size of a set. 

EXPLORA in its present implementation utilizes a broader concept. Interestingness is based 
on the degree of deviation from average or normal, but redundant statements, i.e. statements 
following from another interesting statement, are considered uninteresting- The same holds 
true for statements that are to be expected once an interesting statement is presented; this 
usually means that statements concerning a subgroup are considered uninteresting if the 
corresponding Statement on the group is already exhibited. 

A short paper by Lebowitz* delves into the question of when a short newspaper notice is 
interesting. The goal is not to store umnterestinB parts from the very beginning. He finds three 
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interesting statements from databases . 3 

grounds for non-interest (1) concepts not causally connected to the aula point of the story 

Hi ^ ^J*? abscoCe mm te ^cresting); and (3) concepts that are 
ovcxdiadowed such as the death of the driver if an imporSnt^rson has heraihot. 

Later « ; UbM«B« appEotbese ideas in the machine learning system UNTMEM combining 
e^anatic^ased and similarity-based learning: the interest in certain feonres of a story 
partally guides the search process for generalizations. He stresses the heuristic dimension of 
what is interesting. The paper cites other sporadic attempts to pot the notion of interest in 
concrete form. 

But intcrestiiiBn*» can be still more diffuse and rather intangible, even idiosyncratic. A story 
:^^J nte "f^* ona kn0W5 9Mof ' h ***** Persons. A telephone number is interesting 
toamathematician if it starts with the same digits as the number *. Sometimes, the interesting ' 
pate of a message (an election result, a hit Est, a company telephone directory) is that as 
expected result or entry is missing. 

Certainly, an automated knowledge discovery system cannot anticipate such individual or 
even queer connections that arouse interest; but one should stay aware that they exist Any 
system supporting the user in finding interesting facts should therefore offer sufficient means 
for browsing and for directing the search. 

Wejconcentrate now on aspects of interesting' that are related to the goal of knowledge 
extraction from databases and are susceptible to some kind of formalization. 

2.2. Aspects of intarestingness 

° i ,!? n * bIg , m<S mfewiliar database, what features or aspects or facets of interestingness 
can be tmhxed to guide or delimit the search for interesting pieces of knowledge'' 
Firstly, the subject field under consideration determines a broad boundary of what is 

SESJ^TJ!? - ' ^ aspea * choostae propeT ^ase And Probably a 

subset of the data such as certain groups of variables or objects. We include here also the 
(temporary) selection of a type of dependencies. In EXPLORA, this is denned by an 
unmitialized validation function and the choice of variables to be inserted into it. Connected 
to this function is the linguistic template for stating, the results found by the search algorithm 
Secondly, the censpicuousness or evidence of a finding delimits the degree of interestingness' 
xny result must be tnnuual, unexpected, or important according to some predefined criterion, 
we assume that the criterion not only determines whether a finding (a statement) is ft me and) 
word, recording, but also yields a numerical value for the degree of conspSSei.^ 2Eu 
see later that this first starting point may be modified in order to rake other facets of 
interesangness into account. 

^SSS^S^St y * rC5tODa ***** therefore 

*! * f ***\ C *™? ^Seance'; In our examples, we try to characterise a given subset of a 
set of objects by the values of selected descriptive variables; the conspfcuousness will be based 
^ ?f Volved «* s - There may be doze™ of such characterizations of the gWen 
sunset yielding similar values of conspicuousness. 

Thirdly, the novelty or dissimilarity of a statement with respect to other results found at the 
same time influences the interatmgness. If a statement on a group of objects Is automatically 
or usually also valid for a subgroup, the latter one is redundant and therefore uninteresting 

.L t !L!2 UpU S° US r* 3 ?. M a i«8c «*««> one of them could suffice for describing 
the situation. .However, if the goal is to find an explanation, both statements should probably 
ra considered. 
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4 F. GEBHAROT 

We propose an algorithm ranking the statements In such a way that the ranspicuousness as 
well as the dissimilarity to all previously exhibited statements is taken into account. 

Fourthly, the deviation from prior knowledge is an important ingredient of interestingneas. 
However, ttds aspect is hardly practicable. The computer cannot assess the laser's prior 
knowledge. Even if it could,, the latter might have changed: the user has learned some new facts 
and forgotten other*. Moreover, suppressing an outstanding, buc known, result could make 
the user believe it does not hold. We do not, therefore, consider this aspect further. 

These aspects are not independent of each other. For instance, prior knowledge could be 
incorporated into the evaluation function. In that case the user knows what assumptions are 
already taken into consideration and wrong conclusions should be avoided. 

It is clear that interestmgness is a highly subjective matter. Different persons will almost 
certainly disagree on the relative interestmgness of a bunch of statements. We nevertheless 
propose a procedure fox extracting interesting results, but there are two safeguards: the 
algorithm provides parameters for individual tuning, and no conspicuous statements become 
lost in an absolute sense, they are merely rearranged in a rank order originating from that 
procedure. 



2.3* Test data 

The ideas presented in this paper have been extensively tried out on two tost data sets: 
election data and financial services data. 

For the Federal elections in 1987. Germany was subdivided into 248 constituencies. 
According to law, these have to have approximately the same number of persons entitled to 
vote; the existing differences in size have therefore been neglected. 

There were four xn^jor parties, CDU On Bavaria CSU). SPD, Ft>P and 'Die Grunen\ Their 
election results as wen as the differences from the previous election (in 1983) serve as the 
variables in which the user is interested, for which he wants to find conspicuous results* More 
specifically, one strives to characcehze the 50 (or 30 or 62. for example) best or worst eLectfon 
districts by one or two of the ten demographic variables. 

Most variables including the election outcomes are originally given as percentages; examples 
are portions of unemployed, of persons employed in agriculture, of persons above 65 years of 
age; the exceptions are Bundesland (State) and population density. 

For each demographic variable (except Bundesland) some class boundaries have been set in 
advance such as 11-5"%, 124b, 12-5% and 13% for young persons (18 to 25 years). Only the 
sets of constituencies falling below or above such a boundary have been used in describing the 
best or worst districts of a party. Thus a sample statement for the best 62 districts of SPD 
(election result) is 'Agriculture <5Vo, unemployment >12<76 (IB districts, 16 of them 
belonging to the 62 best ones)'. 

The second data set is a survey on 20000 people regarding their affiliation with bank 
services. The demographic variables include hierarchic ones such as geographical region and 
occupation, numerical ones such as income and age, other ordinal variables such as school 
education and nominal ones such as sex and marital status. 



2-4. Examples for the evidence 
Two Quite different evidence measures have been used based on the following contingency 
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tabic: 



Group • 
Complement 



Goal objects .Non-goal objects 
rtu fin 



When one is looking for groups with a high proportion of goal objects, the basis for the 
evidence is nii/nu + The maxima] value is 1; it is also attained for a group consisting of 
a single goal object. To give small groups a disadvantage, our first evidence function i$ 



In addition, we restricted the search to groups with at least 8 or 10 objects (constituencies). 
This evidence favours groups with a high portion of goal objects; these groups are usually 
small, the best ones containing mostly ten to twenty constituencies. 

The second evidence function is the usual x z * Let be the expected value of cell given 
the row and column suniSt then. 



This function favours large groups since they are statistically more significant- A good 
typical s tat e me nt (50 goal objects) refers to a group with 40 to 50 objects, about three quarters 
of them being goal objects. 

The second data set (financial services) has been explored using V z . Since the data are 
confidential, no concrete examples can be communicated, but in general the behaviour 
resembled that of the election data (processed with J^)- Owing to the large sample the 
values Of x 2 were larger* bat otherwise the difFerences between the dependent variables 
(between parties or between financial services) were more pronounced than those between the 
two data sets, e.g. regarding the distance between the first evidences or the number of 
statements with an evidence at least a certain percentage of that of the best one. 



3.1. Sorting according to the evidence 

From information retrieval, it is well known that sorting the hits— if there are more than 
a dozen or so — according to a suitable rank order has a big advantage to the user. Nothing 
gets lost, but there is a chance that the first few documents already answer the posed question. 
The ranking algorithms are based on the frequencies of the occurrences of the search terms 
in the individual documents. Surprisingly, few commercial systems offer this feature. 

In the case of acquiring statistically confirmed knowledge from databases it should also be 
helpful to sort the results according to their strengths. Sometimes the task leads naturally to 
a grouping; in this case, the results should be sorted according to strength within group. 
If one is looking for conspicuous election results, a grouping according to parties suggests 
itself. 

There U hardly a natural boundary between relevant statements and irrelevant ones. 
Conventional significance boundaries are of little value since one performs conceptionally 
perhaps tens of thousands of tests that in addition, are not independent of each other; 
therefore, a formal significance boundary of 0- 1 % or less is at best a very rough guide. 

Throughout the paper we assume that data exploration yields so many results that it is a 




F a = x 5 =*£0*v-ev)*/ey 



3. RANK ORDER 



DI3 DIIS 



TZZ0UZTL2 IVi ZO'U *0/Td/ZT 



6 F, QEBHA&OT 

problem to extract the really interesting ones* The first tiling to do is thai to sort the search 
results according to the evidence expressed by the evaluation function, possibly within groups 
provided by the nature of the problem. In what follows we investigate refinements that better 
reflect the interest of the user. 



3.2. Actual variations to the evidence 

The evaluation function — perhaps several functions to be used at the discretion of the 
user— is installed in the EXPLORA knowledge discovery system either from the very beginning 
or when a particular database is initiated; it reflects the anticipated general interests of all 
users. 

However, for any session the user may have varying interests. As a partial remedy we now 
propose that the system provides means for selectively modifying the initial value of the 
evidence. This is a means for treating the second aspect of iuterestingness listed in Section 2.2. 
eonsplcuousztess or evidence of a finding. 

The user should certainly have a tool for deleting individual resulting statements that at 
present are of no interest. But this is not the point here. What is needed are means for 
increasing or decreasing the initial evidence for specified types of results corresponding to 
facets of interest. This can he done by adding or subtracting a fixed constant or by multiplying 
with a constant near 1, this choice depending on the nature of the evidence. 

In order to have a reasonable effect* the constant must be big enough (or differ sufficiently 
from 1) to change the order of Statements considerably, but small enough not to dominate the 
new order; the selected group of statements should in general neither displace all others nor 
be pushed completely out of sight* 

No search result gets lost, only the order is altered. This lessens the danger of perhaps 
unwillingly creating the outcome one is eager to verify, i.e. to lie with statistics. 

We now give examples of groups of statements for which the discovery system could provide 
tools to revalue or devalue the evidence. We assume an EXPLORA-like setting where 
statements describe conspicuous groups (subpopulatjons) of the population under 
examination. 

The user modifies all statements that contain a particular variable. The reason for 
devaluation could he that this variable is known to be unreliable or expensive to obtain, or, 
more likely, the result rather than the cause of what is to be explained; similarly for 
revaluation. 

Statements containing just one variable are considered more valuable; therefore more 
complex statements are degraded according to the number of variables* 

Higher levels of a hierarchic variable are preferred; therefore the evidence of statements 
containing lower levels is decreased. 

For a metric or ordinal variable* interior intervals (intervals having a lower and an upper 
boundary) are more complicated and usually more difficult to interpret than open intervals; 
therefore they are degraded. 

Sometimes there exists a temporal dimension in the data: some statements pertain to more 
recent facts than others or some statements in a time-varying data set have gained Importance 
since a specified dare, perhaps the day when this user last attempted to discover novelties. In 
such a context newer results can be upgraded, 

A database allows diverse types of evaluations, such as finding groups with positive and 
negative deviations from the average or, alternatively, considering groups deviating at a certain 
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INTERESTING STATEMENTS FROM DATABASES 7 

time or relative to a previous date. These can of course be considered separately, bat, H the 
statements are mixed, it may be desirable to favour one type over the other. 

Interest being as variable as it is r there is probably no formalized way to test these ideas, 
but they have mostly been tried out on the two databases described in Section 2.3. The author 
admits being biased, but it seems that these modifications of the original evidence work out 
quite well and show the expected behaviour. 

The exact value of the changes to the evidence are not critical; according to our experiences 
it is advisable to use the same constant with all facets of interest and to provide for means to 
upgrade or degrade a statement or a type of statements more than once. 

For clarity, we repeat that these re- and devaluations reflect the momentary interests of the 
user; thus they cannot be incorporated into the original evidence. 



4. DEVALUATION DUE TO AFFINITY 
4.1 Affinity of statements 

We have already mentioned thai a statement may be uninteresting because another one 
which is quite similar is already known or accepted; the interest focuses on novel and dissimilar 
results as introduced in the third aspect of interestingness in Section 2-2. 

The word 'similarity' is often used in connecrion with a distance obeying the triangular 
inequality, which is not assumed here to hold; we therefore prefer 'affinity*. 

We assume now that for any pair of statements At and Aj an affinity S(At>Aj) or S(iJ) 
for short is given* This affinity is to be normalized to 0 < $ ^ L S(A,*Aj) = 0 means there 
is no relationship or affect; knowing Aj does not influence the interestrngness of Au 
S{At,Aj) = 1 signifies the strongest influence of Aj on A; possible. If in addition the evidence 
for Ai Ls considerably lower Chan that for Aj, Ai becomes uninteresting. 

This formulation is by design asymmetric, for we do not even assume 
${Ai,Aj) = S(Aj,j* f ). Wc now give two examples. 



4.2. Examples for affinity 

The first evidence function » V u aims at finding groups of objects (election districts in our 
case) with a high percentage of goal objects. dearly, subgroups of such a group will in general 
also comprise a high portion of goal objects; thus, they are of little interest. On the other hand, 
a larger group can be interesting even if the portion of goal objects is smaller. This situation 
demands an asymmetric affinity. We chose 

Si(Ai 9 Aj)=\MinMj\I\Mtl 

where \M\ is a measure for the size of group Af. If the user is looking for a description of 
the election results, \M\ could be the number of goal objects of A/. Two groups containing 
the same goal objects have affinity one; it does not matter which one we take as far as the 
affinity ia concerned: the decision rests on the evidence only. If, however, the user is searching 
for an explanation, both groups are of interest in the case where their nan-goal objects are 
different even if one of them has a higher evidence. 

In the case of the second evidence, Vi, it seems appropriate to base the affinity also on y 2 . 
The rows and columns of the contingency table now correspond to the two groups and their 
complements. Since S has to be normalized, we divide by the highest possible value, winch is 
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N 9 (the total number of objects): 



If both groups coincide, then & - 1. 

is an Affinity function, then so is ^ for x > 0. Small values of x emphasize 
any similarities thai are present, while large values restrict the influence of the affinity to groups 
that are equal or nearly equaL 

In a similar context (Section 5 of Reference 14) ft has been shown that one should use 5fe 
with x < 0*5. This recommendation is based on an analysis of the case when one group is a 
subgroup of the other; then the larger one should be uninteresting if its deviation from the 
overall mean results purely from the deviation of the subgroup. 

4.3. Devaluation algorithm 

We shall return now to the task of presenting to the user those statements which arc 
supposedly most interesting. 

In principle, all conspicuous statements arc shown, but the most interesting ones should be 
shown first. However, the user is free to stop the presentation when a chosen limit is reached. 

The primary order is given by the evidence. But this order has to be altered: if statement 
Aj depends highly on statement A\ but has a markedly lower evidence, then it is much less 
interesting than other statements with evidences similar to V(Aj) but smaller relationships 
(affinities) to At. If however V{A() and ViAil are (nearly) equal, the two statements should 
not affect one another. 

Roughly speaking, statement A% reduces the effective evidence of statement Aj by an 
amount that depends on the affinity and on the distance between the evidences. One has to 
make sure, however, that a statement A t that has thus been heavily devalued does not influence 
another whose evidence is wdl below V{A$) but above the devalued evidence for Au This 
suggests the following rules, which may at first seem somewhat complicated but are not, as the 
subsequent algorithm shows. 

We introduce restricted evidences Ri(Aj) and R(Aj) by means of 

RdAA-'ViAA-lViAjyRlAi)) ""'> 
R(Aj) = mmt RiiAj) 

with a free parameter 6 to be chosen by the user. R { Is the result of A f devaluing V(Aj); the 
final restricted evidence is the minimum with respect to /. If for some f the square brackets have 
value greater than 1, this r cannot yield the minimum and can be ignored from the beginning. 
The algorithm for computing the restricted evidences proceeds in four steps. 

1. For ail statements A J% let R(Aj)~ VIAj). 

2. Among all statements not yet presented to the user, let Ai be that one with the highest 
reduced evidence R(A { ). It is now presented to the user, 

3* For the remaining statements, the reduced evidence is updated; 
R(Aj) ~ oln(R(Aj) 9 RdAj)). 

4. Return to step 2 unless all statements have been presented or a user defined stop criterion 
has been reached. 

In this algorithm, the user's interests can be quantified m two ways. 

Choosing an exponent * with the affinity determines what constitutes a high affinity. Assume 
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a total Of 24* objects, as in the election data, and two groups of 20 objects each with IS objects 
in common. This yields ft - 0 -53. If one considers these groups to be rather similar, one could 
use x = 0<5 f leading to the effective afifodry 52 = 0-73. If one wants by and large to disregard 
such situations, x = 2 results in 55 = 0-28. This cuts the influence of one group on the other 
Sn half; but both operations have little effect on affinities dose to 1. 

The second choice for the user is 6; fi determines how far one statement devalues another 
that is highly related (affinity close to 1). a should be large enough to produce a striking effect 
except perhaps in the first half dozen statements (these very often have quite similar evidences 
such that the devaluation remains small). In our test data, the range from 1 to 4 turned out 
to be a good choice. 

Two things should be stressed again. The exact values of * and 6 are not Important; 
changing them slightly win switch here and there the order of two statements, but nothing 
becomes lost. The whole procedure might look raiher like 'lying with statistics 1 ; but we are 
here in the context of exploratory statistics and knowledge discovery and the aim is not to 
provo something statistically but to find relationships that are interesting enough for further 
study and confirmation. 

4,4. Limiting behaviour 

In order to get an impression of the results that can be expected, we shall discuss the 
behaviour of the devaluation algorithm for extreme values of d and x- Wc assume without loss 
of generality that the statements are indexed arccordmg to their evidences, i.e. 
V(Ai) £ V(Ai+ x ) for all L To simplify the discussion, we assume furthermore that all 
evidences are different, Lc, V{Ai) > V{Ai+ih unless indicated otherwise. 

Let & be fixed and assume 0 < S(SuAj) < 1 for all/and j (i*J). If *-» 0,S(A fl ^)*-*l. 
Then the devaluation of Aj is determined by the largest quotient V{ ! Aj)fR{At)\ this is dearly 
V(Aj)fV{Ai) since A\ Is not devalued at all and all other original evidences— and therefore 
all reduced evidences— are smaller than, or at most equal to, V(A\\ Thus all statements are 
devalued according to their quotients V{Aj)jV{Ai) f and that leaves their order unchanged. 

If for some j t S{A u Aj)^0 t then Aj is not devalued by A u but it may of course be 
devalued by some other A\. However, we then have 

V{A } )]R(Ai) < V(Aj)t V(Ai) 

so that Aj is devalued less than statements with similar evidence but positive affinity to A\. 
Therefore Aj can advance to an earlier position; of course it depends on the other evidences 
and affinities as to which place it will occupy. 

Now let x-* implying S(Ai,Aj)*->Q. Then no devaluation takes place; the order is again 
unchanged. If. however, for some /and j, V(Ail > V(Aj) and S{Ai>Aj) = 1, then statement 
Aj is devalued. Thus statements that have affinity 1 with another statement with higher 
evidence can be pushed down in the list in this limiting case* 

In summary, only certain statements are affected with regard to their order in the limiting 
cases o and but in between the order may change severely. There is however no way 
to predict which value of x has the largest effect; in fact, as the test data have shown, this can 
be different with similar data such as different dependent variables within the «wf data set 
using the same type of evidence and affinity and the same independent variables, 

Nett let m be fixed and 6 vary. For 5 = 0, the evidence is unchanged: R{Ai) = V(Aj), To 
show the limiting behaviour for let us take logarithms of the definition of Rj(Aj): 

log Ri{Aj) * log V{Aj) + BS(A j9 Ai) log \V{Aj)lR{Ai)] 
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If Aj is devalued by At at an q.c S{Aj f Ai) * 0 aod V{Aj)fll(Ai) < 1). then the second term 
(which is negative at least for / = 1) essentially determines Ri . Thus the final order is that given 
by 

S(Aj.At)log'lViAj)lR(AiJl 

with some obvious modifications if there exist more statements with the same evidence as Ai- 
If / is the smallest index such thai S(A J% Ai) = 0, then Aj is not devalued by Ai and for 
5 large enoughs alt other statements will be devalued below Jt (Aj). Thus the final order is given 
by Ai> then all statements with affinity 0 with A% and between one another and finally all other 
statements ordered according to 

S(/1 >( /1i)Iob[^)/^M X )] 

In our test data, one needs unreasonably large values for 6 to approach this limiting order 
for the first few dozen statements, often 6 > 100 or even 5 > 1000 for useful values of x. But 
this analysis demonstrates the tendency of the effect of large d to favour statements that are 
(nearly) unrelated to the first statement and to each other. j 



4.5. Sample results 

t j 
As a demonstration of the procedure, we look at the 30 best constituencies of the party *Die 

Griinen' according to the percentage of votes in 1987 (this means at least 11-2%). In this case 

we used the evidence function V\ emphasizing groups with a high percentage! of goa ^ objects; 

we restricted the search to groups containing at least 8 out of the 30 districts. The affinity 

function for this example was Si based on aO objects. 
Among the groups that turned out to be of interest are those shown in Table I, 
The meaning of the descriptive variables is as follow: i 

Cath proportion of the population who are Roman Catholics 

Dens population density (inhabitants per square kilometre) 

Emp proportion of the population who are employed 

Old proportion of the population who are above 65 years of age 

Prod proportion of the working population who are in productive trades (industry) 

Serv proportion of the working population who are in service trades (7a); highly 
correlated with Prod (Prod, Serv and agriculture add up to 100<7<>) 

Unetnpl proportion of persons wining to work who are unemployed (9b) 

Young proportion of the population who are of age 18 to 25 years (%), 

Some descriptions cover identical districts! groups 6 and 7, 9 and 10, 13 and 14, 17 and 18. 
This is due to the high correlation between Prod and Serv; Serv > 60 (46 objects, 24 goal 
objects) is » subset of Prod ^ 40 (49 objects, 24 goal objects). 

Table n shows the first ten groups that are offered by the devaluation algorithm with various 
values of the free parameters * and 6. It is not recommended to use x as low as 0- 1 or as high 
as 20. or fi « 8; these lines are added to. demonstrate what happens in the limit. 

Groups 56, 63, 92, 101 and 126 have no elements in common with group 1 (nor with groups 
2 to 5) > In addition, groups 56 and 63 are fairly different (only 1 1 districts in common), white 
92, 101 and 126 are similar to 56. The dissimilarity to groups 1 and 2 is of course (he reason 
they show up at low levels of x. The surprising property of groups 63 and 101 is that they have 
a high unemployment rate. They' are, however, almost subsets of group 6. 

fiei shows little effect (as was to be expected), With 6 = 2 and 6 = 4, group 2 maintains its 
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place due to the small difference in evidence to group 1* Groups 3 to 5 are also quite similar 
to group 1, but their lower evidences result in stronger devolutions. Groups 17 and 18 are much 
larger than group 1 and less similar to group 1 than, for example, group 15; therefore they 
are only stightfy devalued and show up at early places at several parameter combinations. 

Next let us look at the same data with a different affinity function, Si based on goal elements 
only, Le. \M\ is the number of goal elements in group M. Table III shows the first ten 
Statements offered by the devaluation algorithm for x = 0-5, 1* 2 and 4 and 6=1,2 and 4. 

At first glance, the differences to Table JJ are only minor. Some statements appear at 
somewhat earlier places, in particular groups 56 and 63. Actually, their affinity to group i is 
0 both times, so they have not changed. But some other groups have been devalued more 
heavily* 

A good example is group 9. Five of its objects also belong to group 1; all five are goal 
objects. Thus Si{A^AO^ 1/3 when based on all objects and, 1/2 when based on the goal 
objects. 

Similarly, S\(A*>Ai) rises from 0-3 to 0-42. When Ac is only slightly devalued, it devalues 
in turn group 56 (affinity 0-87 and 1 '0* respectively); but as soon as it falls below VCA :6 )> 
group 56 remains unaffected, which is more often the case in Table III than in Table IT. 
Remember that As* has no elements in common with Ai to A$. 

The procedure has been tried out with various combinations Of the 30 or 50 or 62 best or 
worst constituencies based either on the percentages of votes or the gains and bases of the four 
major parties. Rarely does a statement with a number above 100 appear on the first ten places 



Table L Interesting groups of constituencies. Goal objects are the best 30 
consdruettdes of 'Die Grunen' in the Federal elections of 1987 







Objects 


Goal objects 




No. 


Description 


in group 


in group 


Evidence 


1 


Prod < 45> Unempl <6 


12 


10 


0-714 


2 


Serv > 55> Unempl < 6 


n 


9 


0-692 


3 


Old > 23-5, Unempl ^6 


12 


9 


0-643 


4 


Emp] > 45, Unempl £ 6 


14 


10 


0-625 


5 


Dens > 1000, Unempl < 6 


W 


9 


0-600 


6 


Empl > 45, Prod £40 


30 


19 




7 


Empl > 45, Serv > 60 


30 


19 


0-594 


8 


Dens > 2000, Empl > 45 


25 


16 


0-593 


9 


Cath>60. Prod < 40 


15 


10 


0-588 


10 


Caih > 60 k Serv < 60 


IS 


10 


0*588 


12 


Serv > 50, Unempl < 6 


15 


10 


0-589 


13 


Old > 23-5, Prod < 40 


27 


17 


0*586 


14 


Old > 23*5, Serv > 60 


27 


17 


0-586 


IS 


Empl>45, Serv>55 


34 


21 


0-583 


16 


Young < 12, Serv > 60 


29 


18 


0*581 


17 


Dens > 400, Prod < 40 


36 


22 


0-579 


18 


Dens > 400, Serv > 60 


36 


22 


0-579 


29 


Dens > 20. Serv > 60 


41 


24 


0<558 


56 


Cath < 20, Empl > 45 


15 


9 


0-529 


63 


Empl > 43, Unempl > 10 


' 17 


10 


0-526 


92 


Dens > 100O, Caih < 20 


16 


9 


0-500 


101 


Empl > 45, Unempl 8 


20 


11 


0-500 


126 


Empl > 40, Cath < 20 


17 


9 


0-474 
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Table 1L Identification numbers of groups in tht Order produced by the 
devaluation algorithm far various values of * and 5. The affinity s t is based on 

all objects 



* 


6 


Fkst 10 groups 
















0-1 


L 


I 2 


3 


4 


56 


63 


5 


6 


7 


6 


01 




] 2 


56 


3 


63 


4 


101 


92 




0 


A * 
O'l 


■i 
4 


1 2 


56 




3 


101 


92 


4 


126 


6 


U"l 


a 
O 


1 2 


56 




101 


92 


3 


4 


126 


6 


0-5 


1 


1 2 


3 


4 


56 


6 


7 


63 


8 


9 


0-5 


2 


1 2 


56 


63 


3 


4 


O 


7 


o 




□ •5 


A 
*9 


1 2 


56 




3 


101 


92 


6 


7 


4 


0*5 


D 

© 


I 2 


56 


m 

w 


101 


92 




$ 


7 


4 


1 


1 


1 2 


3 


6 


7 


4 


8 


17 


18 


16 


1 


2 


1 2 


56 


63 


3 


6 


7 


17 


18 


16 


1 


4 


1 3 


56 


63 


6 


7 


17 


18 


3 


16 


1 


8 


t 2 


56 


63 


101 


6 


7 


92 


17 


18 


2 


1 


1 2 


3 


6 


7 


8 


17 


18 


16 


15 


2 


2 


1 2 


6 


7 


17 


18 


16 


15 


8 


13 


2 


4 


1 2 


56 


6 


7 


63 


17 


IS 


16 


15 


2 


8 


1 2 


56 


63 


17 


18 


6 


7 


16 


29 


4 


1 


1 2 


6 


7 


8 


9 


10 


13 


14 


3 


4 


2 


1 2 


6 


7 


8 


15 


13 


14 


9 


10 


4 


4 


1 Z 


6 


7 


17 


18 


16 


15 


& 


13 


4 


8 


1 6 


7 


17 


18 


16 


15 


13 


14 


2 


20 


1 


1 2 


6 


7 


8 


9 


10 


15 


3 


13 


20 


2 


1 2 


6 


7 


8 


9 


10 


15 


17 


18 


20 


4 


1 2 


6 


7 


8 


9 


10 


15 


17 


18 


20 


8 


1 6 


7 


a 


9 


10 


P 


18 


19 


15 



Table JOL Identification numbers of groups in the order produced by the 
devaluation algorithm for various values of n and 6* The affinity Si is based Od 

the goal objects 



x J 


j First 10 groups 
















0*5 1 


L 1 2 


3 


4 


56 


6 


7 


63 


8 


9 


0 5 5 


I 1 2 


56 


63 


3 


4 


6 


7 


8 


101 


0-5 < 


1 1 2 


56 


63 


3 


101 


92 


6 


7 


4 


1 1 


L 1 2 


3 


6 


7 


4 


8 


17 


16 


16 


l : 


& 1 2 


56 


63 


3 


6 


7 


17 


18 


16 


l < 


1 1 2 


56 


63 


6 


7 


17 


18 


3 


16 


2 


1 1 2 


3 


6 


7 


8 


17 


18 


16 


15 


2 ; 


2 1 2 


6 


7 


17 


18 


16 


15 


8 


13 


2 


4 1 2 


56 


6 


7 


63 


17 


IB 


16 


15 


4 


1 1 2 


6 


7 


8 


9 


10 


13 


14 


3 


4 : 


2 1 2 


6 


7 


8 


15 


13 


14 


9 


10 


4 


4 1 2 


6 


7 


17 


18 


16 


15 


8 


13 
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as is the case above. Usually small changes occur in the first four to eight places, but quite 
often one or two statements with numbers between 20 and 50 show up among the first ten. 
Similar experiences have been gained with the election data using Vi and Sz as well as with the 
financial data. - 

4»& Actual variations to the affinity 

Just as can the evidence, the affinity can be modified to reflect the actual interests of the user. 
It seems, however, that alterations are less beneficial here. 

With our test data, only one situation arose where alterations to the affinity helped clarify 
the analysis. The financial data included an income variable with quits a number of partitions; 
all income intervals were ca nd idates for statements or constituents thereof. Thus, some 
dependent variables produced many results containing income and another variable. e.g< 
education. 

This annoyance can be reduced if all affinities between statements using the same variables 
are increased, e.g. replacing S(A it Aj) by SiA^Aj) 1 ^ or even by SiAtiAj) 1 '*- 

Another potential application is the mixture of two or more types of statement, 
in the simplest case describing groups where the dependent variable is above or below average. 
Here the affinities within a type could be increased (or those between types decreased, which 
in conjunction with a suitable new x amounts to the same thing). 



5, CONCLUSIONS 

Exploratory data analysis tries to boil huge data sets down to a number of statements 
concerning data of potential interest to the user. There already exists, of course, a diversity 
of methods for detecting interesting features in large data sets; these extend from elementary 
tools in exploratory data analysis 4 like stenvand-Ieaf displays and boxplots over sophisticated 
procedures for uncovering low-dimensional nonlfnearities in high-dimensional data clouds 
(projection pursuits) 7,8 to specialized algorithms for detecting numerical laws of quite complex 
strucure*- 10 as well as other machine learning methods. " ,u 

Here, we have been concerned with the situation in which a search algorithm like that of 
EXPLORA has extracted a number of potentially noteworthy statements but this set is still too 
large to detect easily the really interesting results. In further reducing this set — or rather in 
ranking it — we had to reconcile the two conflicting aspects of mterestingness, viz. 
conspicuausness and novelty. 

The solution presented above draws on two ideas. 

Firstly! statements that are quite similar to a better one are devalued, i.e. pushed further 
down on the list of results. In order to do so, one needs a measure for the importance or worth 
of a statement, its evidence, and a measure for * similar', the affinity. The devaluation 
algorithm then solves the task. 

Secondly, the user should be enabled to cope with varying needs and individualistic, 
subjective facets of ittcrestingness. This is accomplished by the choice of free parameters (x 
and £) and by temporarily altering, if necessary, individual evidences* groups of evidences or 

groups Of affinities. 

The obvious objection to such a procedure is that ic Involves the danger of manipulating the 
results. Therefore we stress again that we are here in the context of exploration* The tests are 
in no way statistically valid: the test procedure will usually be quite rough and the immense 
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number of conceptually performed tests renders all nominal significance levels obsolete 
anyway- As a result, a meaningful procedure for stressing certain statements and lowering the 
weight of others should be beneficial rather than damaging. 

Extensive trials with two quite different data sets have in the author's opinion verified the 
viabiEry and usefulness of the proposed procedure. Unfortunately, resources did not permit 
the validation of the procedures with external users. 

The selection procedure has been adapted to the problem of selecting interesting regression 
models front the Set of aD submodels of the full regression model, 13 In this case, the 
oonspicuousness rests on the opposing criteria of high multiple correlation and low number of 
independent variables (scarcity of the model)! the affinity is determined by the variables that 
two models have in common. 
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